Unmasking Bias: A Red-Teaming Analysis of Racial Disparities in Pretrained Facial Recognition Models¶

Special Topics in AI
By: Jazmin Brown
Completion Date: December 2024

Abstract¶

This study investigates potential racial biases in facial recognition technologies by analyzing the embeddings generated for Black and White individuals. Using a balanced dataset of facial images, we examine the distribution, compactness, and clustering tendencies of embeddings for both racial groups. The findings reveal that Black individuals' embeddings are tighter and more cohesive, with stronger clustering tendencies compared to White individuals, whose embeddings are more dispersed. These differences suggest that the facial recognition model may encode racial groups differently, raising concerns about fairness and accuracy, particularly in high-stakes applications like law enforcement and hiring. While this research provides valuable insights into the model's behavior, it is based on a small sample size and serves as an initial exploration of the topic. Further studies are necessary to understand the broader implications, including examining other racial demographics, expanding the dataset, and investigating how these disparities evolve over time to ensure that facial recognition systems are equitable and reliable.

Introduction: Exploring Racial Disparities in Facial Recognition Systems¶

Facial recognition technology has become an essential tool in a variety of fields, including law enforcement, security, and healthcare, due to its ability to efficiently identify and verify individuals. However, as this technology becomes more widely adopted, concerns about racial bias and fairness have emerged, particularly in its application by law enforcement agencies. A significant body of research has documented how facial recognition systems, including popular models like FaceNet-Pytorch, often perform less accurately when identifying individuals from minority racial groups, especially African Americans. These performance disparities can have serious consequences, such as misidentifications and the amplification of existing racial inequalities in policing.

FaceNet-Pytorch is a widely used deep learning library for face recognition that provides a robust implementation of the FaceNet architecture. This model has shown competitive performance on facial recognition benchmarks and is capable of processing diverse datasets. Although FaceNet-Pytorch has been praised for its accuracy, the racial bias within such systems remains an under-explored area. The model’s ability to perform across various demographics makes it an ideal candidate for investigating the disparities in facial recognition performance across different racial and ethnic groups.

This study aims to investigate racial bias in FaceNet-Pytorch’s facial recognition performance, focusing on whether the system recognizes faces from different racial demographics more accurately using a custom dataset. A custom dataset is particularly useful as it allows for tailored data selection to ensure balanced representation, control over dataset quality and diversity, and the ability to focus on specific research objectives. Additionally, using a custom dataset introduces the possibility that the model may not have been trained on the individuals or characteristics it encounters, providing an opportunity to evaluate how well it generalizes to unseen data. By analyzing the system's performance across these groups, the research seeks to identify disparities that may disproportionately affect one demographic over another. Ultimately, this work aims to contribute to the broader conversation about fairness and accountability in AI, particularly in applications with significant societal implications, such as law enforcement.

To improve the model’s generalization ability and prevent overfitting during the testing phase, augmentations were added to the dataset. The added transformations include random horizontal flipping, random rotation, and color jittering. These augmentations help introduce variability into the data, making the model more robust to different conditions it might encounter in real-world scenarios. RandomHorizontalFlip() introduces variability in the orientation of the faces, helping the model recognize faces from different angles. RandomRotation(15) provides a slight rotational variation to simulate faces that might be slightly tilted in real-world images. ColorJitter() adjusts the brightness, contrast, saturation, and hue of the images, making the model more adaptable to varying lighting conditions. These augmentations ensure that the model doesn't become overly reliant on specific image features, improving its ability to generalize across different racial groups and other variables during testing, even though the model itself was pre-trained.

The specific problem addressed in this research is whether the pre-trained InceptionResnetV1 model in FaceNet-Pytorch better represents White or Black individuals in the custom dataset. The study evaluates whether the system produces more consistent or accurate embeddings for one group compared to the other. The analysis measures how well the model’s embeddings reflect intra-group consistency for each racial group, utilizing metrics such as Euclidean distance and cosine similarity. These metrics quantify the degree of similarity or proximity between embeddings within each group. Higher similarity (lower average distance or higher cosine similarity) indicates that the model generates more consistent representations of individuals within that group, whereas lower similarity suggests that the embeddings are more dispersed. If one racial group's embeddings are significantly more consistent or better represented than the other, it may suggest that the model exhibits bias favoring that group. Such an imbalance could indicate racial bias, as it reflects unequal generalization across groups and may lead to disparities in performance or outcomes.

As these systems become more widespread, their reliance on unbalanced training data and flawed algorithms has led to disproportionate misidentification of individuals from minority racial groups. This bias causes significant harm, resulting in inaccurate identification and reinforcing racial inequalities. The consequences of this bias extend beyond individual cases, contributing to broader societal issues of discrimination and unequal treatment, making it essential to address and mitigate these biases in facial recognition systems.

A Review of Racial Bias Research in Facial Recognition¶

The rise of deep learning and machine learning in facial detection and analysis has brought attention to the cascading impacts of systemic racism and emerging forms of discrimination, primarily driven by unbalanced data practices, such as biased data sampling, collection, labeling, and preprocessing. Studies from the National Institute of Standards and Technology (2002-2019) have highlighted significant racial and gender biases in widely used facial recognition algorithms, with marginalized groups, especially women of color, experiencing the highest misidentification rates and performance drops. This issue is compounded by intersectional discrimination, where overlapping race and gender biases further affect facial analysis technologies. For instance, Microsoft's FaceDetect model showed a 20.8% error rate for dark-skinned women, compared to a 0% error rate for light-skinned men (Leslie, 2020).

Racial bias in face recognition systems is intensified by reliance on oversimplified racial categories influenced by societal prejudices. Skin tone is just one aspect of a complex race concept, and a broader approach that includes facial phenotypes and historical context is needed for fairer evaluations. Bias amplification occurs throughout the recognition pipeline, starting with image acquisition and continuing through each technical stage. Despite this, research on early-stage bias in face detection remains limited. Existing evaluation strategies often overlook these biases, highlighting the need for systems that consider both facial traits and broader social factors (Yucer et al., 2024).

Facial recognition technology is becoming more prevalent in law enforcement for tasks like surveillance and identifying suspects. However, these systems often rely on databases that disproportionately include African Americans, a reflection of systemic biases in arrest and incarceration rates. Although marketed as accurate, studies reveal that these systems tend to perform poorly on African American faces, leading to higher rates of misidentification. This bias is amplified by the lack of diversity in training datasets and the overrepresentation of African Americans in law enforcement databases. As a result, innocent individuals are at risk of being wrongly flagged, exacerbating existing racial inequalities in policing. To address these concerns, experts advocate for mandatory bias testing, independent assessments, and the creation of a certification system to ensure these systems are fair and accurate in law enforcement applications (Garvie and Frankle, 2016).

Face recognition technology has become increasingly popular, particularly in security systems, due to its ability to identify and re-detect faces under diverse conditions. Numerous methods have been proposed to enhance accuracy, with FaceNet standing out as a recent and effective advancement. A study by I. William et al. (2019) explored FaceNet's performance, demonstrating its superiority over other methods. Based on a deep convolutional network and triplet loss training, FaceNet significantly reduces training time by integrating TensorFlow and using pre-trained models. The study, titled "Face Recognition using FaceNet (Survey, Performance Test, and Comparison)", tested FaceNet with datasets such as CASIA-WebFace and VGGFace2 and found it outperformed other methods, achieving near-perfect accuracy on datasets like YALE, JAFFE, and AT&T, and a maximum of 99.375% accuracy on the Essex Faces94 dataset. This study emphasized the effectiveness of FaceNet in face recognition tasks, highlighting its potential for diverse applications (I. William et al., 2019)

The growing demand for security systems, driven by reduced installation and storage costs, has led to the widespread use of video surveillance and digital authentication technologies. However, traditional human-monitored systems are prone to errors and difficult to scale. A study by A. F. S. Moura et al. (2020) evaluated the FaceNet approach using the Labeled Faces in the Wild benchmark and compared it with a machine learning technique known as support vector machine (SVM) for classifying FaceNet-generated embeddings. The study demonstrated that combining FaceNet with SVM achieved 90% accuracy in a real-time facial recognition system using a medium-quality webcam. This approach highlighted the effectiveness of integrating FaceNet with other machine learning techniques for practical applications in video surveillance and security (A. F. S. Moura et al., 2020).

Motivation: Advancing Fairness in AI and Justice¶

As a graduate student with a strong interdisciplinary background in criminal justice, forensics, cybersecurity, and data science, my passion lies at the intersection of technology and justice. The rapid advancement of AI and facial recognition technology has transformed numerous industries, offering innovative solutions for identification and security. In law enforcement, these systems are increasingly employed to enhance public safety and streamline investigations. However, racial bias in facial recognition remains a pressing concern, as these systems often disproportionately misidentify individuals from marginalized racial groups. Such inaccuracies can lead to wrongful outcomes, reinforce systemic inequalities, and undermine trust in the criminal justice system.

This study seeks to shed light on these critical challenges and advocate for the development of facial recognition technologies that are fair, equitable, and accountable, particularly in high-stakes applications with significant societal impact. While this research provides only a sliver of analysis, it serves as an essential starting point for a broader exploration into the ethical implementation of these powerful technologies.

Resourceful Planning¶

Semester-Long Research Project Plan¶

This semester-long research project is estimated to cost approximately $10,000, covering essential resources such as software tools, dataset sourcing, cloud infrastructure, and personnel costs. The project team will consist of the lead researcher, a data scientist, and one research assistant, each with clearly defined roles to ensure efficient collaboration on data preparation, testing, and report compilation.

Project Phases¶

1. Planning and Setup (Month 1)¶
  • Define project objectives, deliverables, and success metrics, ensuring the scope remains focused.
  • Assemble and prepare the necessary tools, datasets, and cloud infrastructure.
  • Identify potential risks (e.g., data quality or computational limitations) and develop mitigation strategies.
2. Testing and Evaluation (Months 2-3)¶
  • Systematically pass samples from the custom dataset through the pre-trained FaceNet model.
  • Use established metrics to evaluate the model’s performance across racial groups, focusing on intra-group consistency and accuracy.
  • Conduct regular progress reviews to ensure the analysis remains aligned with the project’s goals.
3. Analysis and Reporting (Month 4)¶
  • Analyze the results, focusing on identifying any disparities in the model's performance across racial demographics.
  • Compile findings into a comprehensive final report, including key insights, visualizations, and actionable recommendations for future research and development.
  • Prepare deliverables for potential presentation at conferences or academic publication.

Guiding Principles¶

The project will be guided by best practices, including:

  • Frequent Progress Monitoring: Regular check-ins to track milestones and adjust plans as needed.
  • Risk Management: Proactively addressing potential challenges to minimize disruptions.
  • Open Communication: Ensuring transparency and collaboration among team members.
  • Thorough Documentation: Maintaining detailed records of methods, results, and insights for reproducibility.

Success Criteria¶

Success will be measured by:

  • The clarity and rigor of the performance analysis.
  • Contributions to ongoing discussions about fairness and accountability in AI.

This study represents a foundational step in addressing racial bias in facial recognition, with the potential to inform broader efforts toward equitable AI systems.

A Technical Dive into Bias Evaluation¶

[Figure 1]¶

The figure above presents the distribution of embeddings by racial group, highlighting the number of embeddings for each group. Specifically, there are 50 embeddings for White individuals and 50 embeddings for Black individuals. This equal representation ensures a balanced comparison between the two groups in subsequent analyses. The bar plot clearly distinguishes the two groups, with White embeddings represented in blue and Black embeddings in red. The y-axis shows the number of embeddings for each group, emphasizing the uniformity in the dataset for both racial categories. By visualizing the distribution of embeddings, we can ensure that both racial groups are adequately represented, which is essential for identifying any disparities or biases in the facial recognition model’s performance.

Black embeddings have tighter clusters (lower average distance).
Black embeddings have tighter clusters (higher cosine similarity).

[Figure 2]¶

This plot illustrates the distribution of embedding tightness across racial groups using both Euclidean and Cosine metrics. The Euclidean distances indicate that embeddings for the White group are more dispersed (mean: 1.20, standard deviation: 0.06) compared to the Black group (mean: 1.09, standard deviation: 0.09). Conversely, the Cosine similarity shows that embeddings for the Black group are tighter and more consistent (mean: 0.42, standard deviation: 0.12) compared to the White group (mean: 0.24, standard deviation: 0.10). These metrics provide a comparative view of the embedding compactness and variability between the two groups.

[Figure 3]¶

The boxplot generated by this function displays the distribution of tightness values for both White and Black groups, comparing Euclidean and Cosine metrics. It visualizes the spread and central tendency of tightness within each group, allowing for a comparison of intra-group consistency. The plot uses color to distinguish between the two racial groups (blue for White, red for Black) and shows how the tightness values differ across the two metrics.

Average proportion of same-race neighbors for White individuals: 0.70
Average proportion of same-race neighbors for Black individuals: 0.73

[Figure 4]¶

The nearest neighbor analysis reveals the average proportion of same-race neighbors for White and Black individuals based on their embeddings. The analysis compares the k-nearest neighbors (default k=5) for each individual, evaluating how often individuals from the same racial group are located near one another in the embedding space. For White individuals, the average proportion of same-race neighbors is 0.70, meaning 70% of the nearest neighbors are also White. For Black individuals, the average proportion is 0.73, indicating that 73% of their nearest neighbors are Black. This suggests that the embeddings for both racial groups tend to cluster more closely with others from the same group, but Black individuals have a slightly higher proportion of same-race neighbors compared to White individuals.

Findings from Red-Teaming Analysis¶

The analyses reveal critical insights into the behavior of the facial recognition model across racial groups, highlighting disparities in embedding distribution, tightness, and clustering tendencies. By ensuring equal representation of 50 embeddings each for White and Black individuals (Figure 1), the dataset establishes a balanced foundation for fair comparison. This uniformity minimizes the risk of biased outcomes due to unequal representation, enabling a more accurate assessment of disparities in the model’s performance. The bar plot in Figure 1 visually distinguishes the two groups, using blue for White and red for Black, with the y-axis emphasizing the uniformity in the dataset. This balance ensures that any disparities identified in subsequent analyses are not a result of data imbalance but intrinsic differences in the embedding space.

Further examination of embedding tightness using both Euclidean and Cosine metrics reveals marked differences in compactness between racial groups (Figure 2). The Euclidean distances indicate that embeddings for White individuals are more dispersed, with a mean of 1.20 and a standard deviation of 0.06, whereas the embeddings for Black individuals are more tightly clustered, with a lower mean of 1.09 and a higher standard deviation of 0.09. Conversely, the Cosine similarity metric shows a reversed trend: embeddings for Black individuals are more compact and consistent, with a mean similarity of 0.42 and a standard deviation of 0.12, compared to the White group, which exhibits a mean similarity of 0.24 and a standard deviation of 0.10. These observations underscore that Black individuals' embeddings are generally tighter and more cohesive, while those for White individuals are more dispersed, regardless of the metric used.

The boxplot in Figure 3 builds on this analysis, visually comparing the spread and central tendency of tightness values across racial groups and metrics. For both Euclidean and Cosine metrics, the Black group consistently demonstrates narrower interquartile ranges, highlighting greater intra-group uniformity. Additionally, the median tightness values for Black individuals are closer together across both metrics, while White individuals’ values exhibit greater variability. These results confirm the observation that Black individuals’ embeddings are more consistent and compact, further highlighting disparities in how the model encodes individuals from different racial groups.

The nearest neighbor analysis (Figure 4) adds another dimension to these findings by evaluating clustering tendencies in the embedding space. For White individuals, an average of 70% of the nearest neighbors belong to the same racial group, while for Black individuals, this figure rises slightly to 73%. These results indicate that embeddings for both racial groups tend to cluster predominantly with individuals of the same race, but Black individuals exhibit a marginally stronger same-group clustering tendency. This difference may suggest subtle variations in the model's treatment of racial groups in terms of embedding space organization.

Future Directions and Implications¶

This research offers valuable insights into the potential racial bias present in facial recognition models, but several avenues remain for further investigation and enhancement. A key direction for future work is expanding the dataset to increase its diversity and accuracy. Currently, the dataset primarily consists of individuals from specific demographic backgrounds, which may lead to bias due to the overrepresentation of high-status individuals or celebrities. To address this, future studies could incorporate volunteers from diverse backgrounds, including students and staff from universities or other community-based groups. This approach would help mitigate the risk of pretrained models being overly influenced by the high status and widespread recognition of certain individuals, ensuring more representative data.

Additionally, this research could benefit from a more extensive approach, extending the study to include a broader range of racial demographics beyond just Black and White individuals. Studying more racial and ethnic groups would provide a more comprehensive view of how facial recognition systems interact with diverse populations. Furthermore, ongoing research over a longer period would also allow for the inclusion of new datasets as they become available, providing an opportunity to monitor how facial recognition systems evolve over time, particularly in terms of model modifications and updates. By adapting the model regularly, researchers can ensure it remains relevant and continues to provide accurate, fair results across different demographics.

Another important future direction involves examining the significance of the discrepancies in the metrics (such as tightness and similarity) between racial groups. Further investigation into the underlying causes of these differences can uncover systemic biases within the model or dataset, and help to develop strategies for improving fairness and accuracy. This deeper analysis may include more advanced statistical methods or machine learning techniques to quantify and address disparities in facial recognition performance, ensuring that the model is equally effective for all racial groups.

In conclusion, the future of this research lies in continually improving dataset diversity, incorporating real-time data and volunteer contributions, and exploring the underlying factors contributing to observed discrepancies. These steps will help refine facial recognition models, promoting fairness and equity in applications where these technologies are deployed, such as law enforcement and security.

Discussion¶

This analysis highlights significant differences in how the facial recognition model encodes embeddings for Black and White individuals, despite an evenly balanced dataset. Specifically, embeddings for Black individuals are tighter and more cohesive, with stronger clustering tendencies, while those for White individuals are more dispersed. Additionally, Black individuals demonstrate a slightly higher proportion of same-race neighbors, suggesting a stronger propensity for clustering within their group.

These findings indicate that the model may encode racial groups differently, which could have implications for its fairness and accuracy. The observed disparities raise concerns about potential biases in the model’s behavior, particularly in high-stakes applications such as law enforcement and hiring. However, given the small sample size and the preliminary nature of this study, these results should be considered as a starting point for further exploration. To gain a more comprehensive understanding, it is essential to expand this study beyond this initial analysis. Such research will be instrumental in ensuring that facial recognition systems are developed and deployed in a way that is fair, equitable, and free from bias, particularly in high-stakes environments where the consequences of misidentification can be severe.

Links to References & Data Sources¶

  • Understanding bias in facial recognition technologies

  • Racial Bias within Face Recognition: A Survey

  • Facial-Recognition Software Might Have a Racial Bias Problem

  • Face Recognition using FaceNet (Survey, Performance Test, and Comparison)

  • Video Monitoring System using Facial Recognition: A Facenet-based Approach

  • Black & White Faces Dataset